Análisis Genómicos y Transcriptómicos con plataforma NGS
Tabla de Contenidos
- Lecturas cortas ‘short reads’
- Alineamiento frente a genoma de referencia
- alineamiento
- indexar genoma de referencia
- mapear lecturas (
.fastq) .sam->.bam
- visualización con IGV
- alineamiento
- Análisis de Expresión Diferencial
- alineamiento
- gene quantification
- DESeq workflow
- organize the data
- DESeq2 object
- normalize raw counts
- clustering analyses
- model fitting
- model contrast
- resuls table
- visualizing results
- DEG annotation
Enlaces R & Bioconductor
Instalación de paquetes desde CRAN
install.packages("R.utils") # <- uncompress .gz files
install.packages("BiocManager") # <- Bioconductor manager
Instalación de Bioconductor
BiocManager::install()
Verificar la version de Bioconductor que estamos usando y si los paquetes están actualizados:
BiocManager::version()
BiocManager::valid()
Los módulos de trabajo son lo siguientes:
- análisis de secuencias cortas (
shortRead) - mapeo de secuencias cortas a un genoma (
alignment) - análisis de expresión diferencial (
DEA)
ShortRead
*Analisis de genomas y transcriptomas con plataforma NGS
*Master Biotecnologia
*2025
Jose V. Die, Dept. Genetics, UCO
jose.die@uco.es
Instalación del paquete que usaremos en este módulo.
BiocManager::install("ShortRead") # <- manipulate FASTQ filesUna vez instalados los paquetes necesarios para este módulo, cargamos las librerías.
# Dependencies
library(R.utils)
library(ShortRead)Descarga ficheros
Estructura del proyecto
dir.create("sreads") # create a directory/folder in the indicated path
dir.create("sreads/fastq") # create a directory/folder in the indicated path
dir.create("sreads/genome") # create a directory/folder in the indicated path
dir.create("sreads/results") # create a directory/folder in the indicated path
dir("sreads") # see files and folders in the indicated path)
Descarga ficheros secuencia genómica
download.file("https://www.dropbox.com/s/4ft480eky7kghzw/Saccharomyces_cerevisiae_genome.fa.gz?dl=1",
destfile = "sreads/genome/Saccharomyces_cerevisiae_genome.fa.gz")
Descarga de secuencias cortas
download.file(url = "https://www.dropbox.com/scl/fi/662wqxas24h0mdzadfffh/ENASRR23944525.fastq?rlkey=a3u8htlieoyeh4o9dgplax831&dl=1",
destfile = "sreads/fastq/SRR23944525.fastq.gz")
Descompresión de las secuencias fasta.gz y fastq.gz
# Uncompress fasta.gz & fastq.gz files
R.utils::gunzip("sreads/genome/Saccharomyces_cerevisiae_genome.fa.gz", remove = FALSE)
R.utils::gunzip("sreads/fastq/SRR23944525.fastq.gz", remove = FALSE)
En este módulo queremos tener un visión general de las posibilidades que ofrece el paquete ShortReads para el tratamiento de secuencias cortas (ej. lecturas generadas por un secuenciador tipo Illumina). Estas lecturas se presentan en un fichero con formato fastq pero el mismo paquete permite también el tratamiento de secuencias almacenadas en un fichero fasta
formato fasta
# # Read fasta : Saccharomyces genome
fa_sample <- readFasta(dirPath = "sreads/genome/", pattern = "genome")
fa_sample <- readFasta(dirPath = "sreads/genome/", pattern = ".fa$")# # Print sample
fa_sample## class: ShortRead
## length: 17 reads; width: 85779..1531933 cycles
Podemos acceder a las funciones (methods) disponibles para un objeto de clase ShortRead con la siguiente llamada:
# # Methods accessors
methods(class = "ShortRead")## [1] [ alphabetByCycle append clean
## [5] detail dustyScore id length
## [9] narrow pairwiseAlignment show srdistance
## [13] srduplicated sread srorder srrank
## [17] srsort tables trimEnds trimLRPatterns
## [21] width writeFasta
## see '?methods' for accessing help and source code
Veamos el resultado de aplicar alguno de los métodos disponibles :
length(fa_sample)## [1] 34
width(fa_sample)## [1] 230218 813184 316620 1531933 576874 270161 1090940 562643 439888
## [10] 745751 666816 1078177 924431 784333 1091291 948066 85779 230218
## [19] 813184 316620 1531933 576874 270161 1090940 562643 439888 745751
## [28] 666816 1078177 924431 784333 1091291 948066 85779
id(fa_sample)## BStringSet object of length 34:
## width seq
## [1] 52 I dna:chromosome chromosome:R64-1-1:I:1:230218:1 REF
## [2] 54 II dna:chromosome chromosome:R64-1-1:II:1:813184:1 REF
## [3] 56 III dna:chromosome chromosome:R64-1-1:III:1:316620:1 REF
## [4] 55 IV dna:chromosome chromosome:R64-1-1:IV:1:1531933:1 REF
## [5] 52 V dna:chromosome chromosome:R64-1-1:V:1:576874:1 REF
## ... ... ...
## [30] 58 XIII dna:chromosome chromosome:R64-1-1:XIII:1:924431:1 REF
## [31] 56 XIV dna:chromosome chromosome:R64-1-1:XIV:1:784333:1 REF
## [32] 55 XV dna:chromosome chromosome:R64-1-1:XV:1:1091291:1 REF
## [33] 56 XVI dna:chromosome chromosome:R64-1-1:XVI:1:948066:1 REF
## [34] 57 Mito dna:chromosome chromosome:R64-1-1:Mito:1:85779:1 REF
sread(fa_sample)## DNAStringSet object of length 34:
## width seq
## [1] 230218 CCACACCACACCCACACACCCACACACCACAC...TGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
## [2] 813184 AAATAGCCCTCATGTACGTCTCCTCCAAGCCC...GTGGTGTGTGTGGGTGTGGTGTGTGGGTGTGT
## [3] 316620 CCCACACACCACACCCACACCACACCCACACA...GGTGTGTGGGTGTGGTGGGTGTGGTGTGTGTG
## [4] 1531933 ACACCACACCCACACCACACCCACACACACCA...AACATAAAATAAAGGTAGTAAGTAGCTTTTGG
## [5] 576874 CGTCTCCTCCAAGCCCTGTTGTCTCTTACCCG...AAGAACAGGGTTTCATTTTCATTTTTTTTTTT
## ... ... ...
## [30] 924431 CCACACACACACCACACCCACACCACACCCAC...GGGTGTGGTGTGGGTGTGGTGTGTGTGTGGGG
## [31] 784333 CCGGCTTTCTGACCGAAATTAAAAAAAAAAAA...GGTGTGTGGGTGTGTGTGGGTGTGGTGTGGGT
## [32] 1091291 ACACCACACCCACACCACACCCACACCCACAC...TAGATGTGAGAGAGTGTGTGGGTGTGGTGTGT
## [33] 948066 AAATAGCCCTCATGTACGTCTCCTCCAAGCCC...TTTCATTTTTTTTTTTTAATTTCGGTCAGAAA
## [34] 85779 TTCATAATTAATTTTTTATATATATATTATAT...GAAATATGCTTAATTATAATATAATATCCATA
Podemos extraer la secuencia de nucleótidos de un objeto de clase DNAStringSet :
# # Extract sequence from DNAStringSet object (100 first letters)
substr(toString(sread(fa_sample)[1]), 1, 100)## [1] "CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG"
Otra forma de extraer la secuencia :
as.character(sread(fa_sample)[1])
El objeto de clase ShortRead puede escribirse en formato fasta :
# # Write a ShortRead object
writeFasta(object = fa_sample, file = "sreads/results/Scerevisiae_genome.fasta")
formato fastq
# # Read fastq: Beauveria bassiana reads
#fq_sample <- readFastq(dirPath = "fastq", pattern = "fastq")
fq_sample <- readFastq(dirPath = "sreads/fastq/SRR23944525.fastq.gz")# # Print sample
fq_sample## class: ShortReadQ
## length: 603239 reads; width: 35..76 cycles
# counts the nubmer of records, nucleotides, and base-level quality scores
countFastq(dirPath = "sreads/fastq/SRR23944525.fastq")## records nucleotides scores
## SRR23944525.fastq 603239 45656825 45656825
Podemos acceder a las funciones (methods) disponibles para un objeto de clase ShortReadQ con la siguiente llamada:
# # Methods accessors
methods(class = "ShortReadQ")## [1] [ [<- alphabetByCycle alphabetScore
## [5] append clean coerce detail
## [9] dustyScore id length narrow
## [13] pairwiseAlignment qa reverse reverseComplement
## [17] show srdistance srduplicated sread
## [21] srorder srrank srsort tables
## [25] trimEnds trimLRPatterns trimTails trimTailw
## [29] width writeFasta writeFastq
## see '?methods' for accessing help and source code
Veamos el resultado de aplicar alguno de los métodos disponibles :
sread(fq_sample)[1:20]## DNAStringSet object of length 20:
## width seq
## [1] 75 GGTGCAAGGGGGTCGGAGTTGATGGCGGAGTGG...CATCGGCAGGCTGCGTTGGCAGATGCCTTCTTT
## [2] 76 CAATACTCGGTCAAAAAAAAAAAAAAAAAAGGA...GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
## [3] 76 CAAAGTGCTTGGGCTCCAGGGGGAGTATGGTCG...ACTTAAAAAAAAAAAAAAGATCGGAAGAGGGGG
## [4] 75 CTGGCTACGAGAGCATCAAGCGCGTGGCTTTGA...TCAAGCCGCGCGTGCGCTGGTGACGCGACGTCG
## [5] 74 GGTGAGGCCGGCTATCTGAGTTTCATCGTGCAA...CCACAATTGACGTTGCATTCAACGTCTGTGCAT
## ... ... ...
## [16] 76 AGTGAGGTGTTGGCGGCGTTGCAGATGAGCATA...GCCAAGTCGAGTAACGAGAGATGTGGCGAGTCA
## [17] 75 ACATGCTTCGAGGTCCAAGCTGTGAGGTTTACT...AGGGATACCCGGATGGCGTTGATGAATTATGGG
## [18] 74 CGGGCCCGAGATTCCAGTGTTTGGCATTGACAT...ACGCCTACTACTTGCAGTATCTCAACGGCAAGG
## [19] 76 TTATTAAATAAAAAAAAAACGTATACTACCCTT...TATAACTTTACAGCCAATTTCACAATAATATTA
## [20] 76 GAGGAGTGGCGCAAGTAGCATCGGCAGGCTGCG...CCTTCTTTCTATGCAATCTGATATAGTATGAAA
width(fq_sample)[1:20]## [1] 75 76 76 75 74 76 76 76 74 75 76 76 76 76 76 76 75 74 76 76
sum(width(fq_sample)) / 1e6 ## [1] 45.65682
# comparar con slide "entrez" : 45.7M
# https://www.ncbi.nlm.nih.gov/sra/SRR23944525hist(width(fq_sample), main = "Reads length")
abline(v = 75, col = "red", lwd = 2)Podemos filtrar las lecturas que tengan un tamaño mínimo :
fqsample_size <- fq_sample[width(fq_sample) >= 75]
hist(width(fqsample_size), main = "Reads_filtered length")Número de lecturas originales y lecturas filtradas por tamaño:
length(fq_sample) / 1e3## [1] 603.239
length(fqsample_size) / 1e3## [1] 567.342
El objeto de clase ShortReadQ puede escribirse en formato fastq:
# # Write a ShortRead object
writeFasta(object = fqsample_size, file = "sreads/results/reads75.fastq")
Quality assessment
# get quality scores per base (for all sequences in the fastq files!) as a matrix
qPerBase = as(quality(fq_sample), "matrix")Ahora podemos repasar la calidad de una o de varias secuencias. Por ejemplo, los primeros 20 nucleótidos de las 3 primeras lecturas que aparecen en el fichero .fastq:
(Remember : a score of 30 is considered a good quality as it means the accuracy of base call is 99.9%)
# Check quality of one/some sequence(s)
qPerBase[1:3, 1:20] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 32 32 32 32 32 36 36 36 36 36 36 36 36 36
## [2,] 14 32 32 14 32 14 14 32 14 36 36 36 36 32
## [3,] 21 32 32 32 21 36 14 14 36 36 21 21 21 32
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 36 36 36 36 36 36
## [2,] 14 14 36 36 14 36
## [3,] 14 36 36 36 14 14
Podemos filtrar las lecturas que tengan una calidad mínima :
- por cada secuencia : sumar bases < calidad mínima
- extraer las secuencias donde el paso anterior = 0
# get number of bases per read that have quality score below 30
qcount = rowSums(qPerBase < 30, na.rm = TRUE)
# Number of reads where all Phred scores > 30
# qcount == 0 , meaning all quality scores per base in the read >= 30
fqsample_qfiltered = fq_sample[qcount == 0]Podemos filtrar las lecturas que tengan un tamaño mínimo :
fqsample_size <- fq_sample[width(fq_sample) >= 75]
hist(width(fqsample_size), main = "Reads_filtered length", xlab = "width (bp)")Número de lecturas originales y lecturas filtradas por tamaño:
## N. total de lecturas originales: 603.239
## N. total de lecturas filtradas por tamaño: 567.342
El objeto de clase ShortReadQ puede escribirse en formato fastq:
# # Write a ShortRead object
writeFasta(object = fqsample150, file = "sreads/results/reads150.fastq")
Quality assessment
# get quality scores per base (for all sequences in the fastq files!) as a matrix
qPerBase = as(quality(fq_sample), "matrix")Ahora podemos repasar la calidad de una o de varias secuencias. Por ejemplo, los primeros 20 nucleótidos de las 3 primeras lecturas que aparecen en el fichero .fastq:
(Remember : a score of 30 is considered a good quality as it means the accuracy of base call is 99.9%)
# Check quality of one/some sequence(s)
qPerBase[1:3, 1:20] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 32 32 32 32 32 36 36 36 36 36 36 36 36 36
## [2,] 14 32 32 14 32 14 14 32 14 36 36 36 36 32
## [3,] 21 32 32 32 21 36 14 14 36 36 21 21 21 32
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 36 36 36 36 36 36
## [2,] 14 14 36 36 14 36
## [3,] 14 36 36 36 14 14
Podemos filtrar las lecturas que tengan una calidad mínima :
- por cada secuencia : sumar bases < calidad mínima
- extraer las secuencias donde el paso anterior = 0
# get number of bases per read that have quality score below 30
qcount = rowSums(qPerBase < 30, na.rm = TRUE)
# Number of reads where all Phred scores > 30
# qcount == 0 , meaning all quality scores per base in the read >= 30
fqsample_qfiltered = fq_sample[qcount == 0]Calidad general de las lecturas : representación visual.
#fastq original reads (non-filtered):
qaSummary = qa(fq_sample, lane = 1)
df = qaSummary[["readQualityScore"]]
#fastq filtered reads by quality
qaSummary_q = qa(fqsample_qfiltered, lane = 1)
df_q = qaSummary_q[["readQualityScore"]]
# Plotting base quality
par(mfrow=c(1,2))
ShortRead:::.plotReadQuality(df[df$type=="read",])ShortRead:::.plotReadQuality(df_q[df$type=="read",]) (Lanes with consistently good quality reads have strong peaks at the right of the panel).
De nuevo, podemos escribir en formato fastq el objeto de clase ShortReadQ que contiene las lecturas filtradas con la calidad mínima seleccionada :
writeFastq(object = fqsample_qfiltered,
file = "sreads/results/reads_Qfiltered.fastq")
Calidad general de las lecturas : número de secuencias.
## N. total de lecturas en dataset original: 603.239
## N. total de lecturas en dataset filtrado por calidad: 159.777
Calidad general de las lecturas : estadística descriptiva.
# summaries of reads in non-filtered and filtered datasets
summary((width(fq_sample)))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 76.00 76.00 75.69 76.00 76.00
summary((width(fqsample_qfiltered)))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 64.00 76.00 76.00 75.67 76.00 76.00
# histograms of reads in non-filtered and filtered datasets
par(mfrow=c(1,2))
hist(width(fq_sample), main = "non-filtered reads", xlab = "length")
hist(width(fqsample_qfiltered), main = "filtered reads", xlab = "length")Alineamiento
*Analisis de genomas y transcriptomas con plataforma NGS
*Master Biotecnología
*2025
Jose V. Die, Dept. Genetics, UCO
jose.die@uco.es
En este módulo trabajaremos con lecturas de Saccharomyces cerevisiae generadas en un experimento RNA-Seq. Las lecturas obtenidas son del tipo paired-end reads donde cada fragmento está secuenciado por los dos extremos, generando una estructura de ficheros de este tipo:
- FFFF_1.fastq.gz
- FFFF_2.fastq.gz
Además, trabajaremos con un formato nuevo de fichero (.gff3) que contiene información de anotación genómica.
Formatos ficheros :
- .FASTA (.FA,.FNA): biological sequences (eg. chromosomes)
- .FASTQ (.FQ): short-reads (eg. illumina)
- .GFF (.GFF3, .GTF): genomic elements coordinates (eg. genes)
Para completar este módulo se realizarán las siguientes tareas :
- descarga anotación genoma S. cerevisiae (
.gff3) - descarga de secuencias RNA-Seq (SRR9336468)
- descompresión de ficheros (
.gz) - alineamiento BLAST
- genome mapping alignment
- index the reference genome
- align fastq files
- convert
.samfiles into.bamfiles
- index the reference genome
- IGV : visual exploration of genomic data
- filtering reads by mapping quality
Mapeador : Bowtie2
- publicación
- Bowtie2 software homepage
Visualización datos genómicos : IGV
Instalación del paquete que usaremos en este módulo.
BiocManager::install("Rbowtie2") # <- align FASTQ files
BiocManager::install("Rsamtools") # <- manipulate SAM filesUna vez instalados los paquetes necesarios para este módulo, cargamos las librerías.
# Dependencies
library(Rbowtie2)
library(Rsamtools)Descarga ficheros
Estructura del proyecto
# check folders - if FALSE, create the folders
dir.exists("mapping")
dir.exists("mapping/fastq")
dir.exists("mapping/genome")
dir.exists("mapping/bam")
dir("mapping") # see files and folders in the indicated path)Descarga ficheros secuencia genómica
# download if the file is not in the folder
if (!file.exists("mapping/genome/Saccharomyces_cerevisiae_genome.gff3")) {
download.file("https://www.dropbox.com/s/qtaret1hrbvw2xb/Saccharomyces_cerevisiae_genome.gff3.gz?dl=1",
destfile = "mapping/genome/Saccharomyces_cerevisiae_genome.gff3.gz")
}
También necesitamos el fichero que contiene la secuencia del genoma de S. cerevisiae (fasta). Como lo descargamos en el módulo anterior, podemos copiarlo desde allí.
file.copy(from = "sreads/genome/Saccharomyces_cerevisiae_genome.fa", to = "mapping/genome/" )Descarga de secuencias cortas
download.file("https://www.dropbox.com/s/v06um7vt9ojdf42/NGS_MB_SRR9336468_1.fastq.gz?dl=1",
destfile = "mapping/fastq/1M_SRR9336468_1.fastq.gz")
download.file("https://www.dropbox.com/s/kgxfth8ra675ccu/NGS_MB_SRR9336468_2.fastq.gz?dl=1",
destfile = "mapping/fastq/1M_SRR9336468_2.fastq.gz")Descompresión de las secuencias fasta.gz y fastq.gz
# Uncompress fasta.gz & fastq.gz files
#R.utils::gunzip("mapping/genome/Saccharomyces_cerevisiae_genome.fa.gz", remove=TRUE)
R.utils::gunzip("mapping/genome/Saccharomyces_cerevisiae_genome.gff3.gz", remove=TRUE)
R.utils::gunzip("mapping/fastq/1M_SRR9336468_1.fastq.gz", remove=TRUE)
R.utils::gunzip("mapping/fastq/1M_SRR9336468_2.fastq.gz", remove=TRUE)En una consola LINUX podemos observar el fichero de anotación gff3 y los ficheros fastq.
system("grep -v '#' genes/Saccharomyces_cerevisiae_genome.gff3")
# # Total number of reads ?
system("wc -l fastq/1M_SRR9336468_1.fastq")# # Look at the two reads from the same cDNA fragment
system("head -n 4 fastq/1M_SRR9336468_1.fastq")BLAST genome
A modo ilustrativo, vamos a alinear la primera secuencia de cada fichero contra la secuencia de referencia del genoma de Saccharomices.
Comenzaremos con la secuencia @SRR9336468.1 1/2 y analizaremos en el resultado del BLAST el sitio único del alineamiento o mapeo, las coordenadas del intervalo, la coordenada de la primera base de la secuencia y la orientación de la cadena.
Del mismo modo, seguiremos con la secuencia @SRR9336468.1 1/1.
Run blast genome : saccharomices
Genome mapping alignment
- Index the reference genome
# # use the indexer `bowtie2_build`
bowtie2_build(references = "mapping/genome/Saccharomyces_cerevisiae_genome.fa",
bt2Index = "mapping/genome/Saccha_genome",
overwrite = TRUE)# # create `.fai` file with `indexFa` from `Rsamtools`
indexFa("mapping/genome/Saccharomyces_cerevisiae_genome.fa")- Align
.fastqfiles against the indexed genome
# # align fastq files against the genome with `bowtie2`
bowtie2(bt2Index = "mapping/genome/Saccha_genome",
samOutput = "mapping/bam/SRR9336468.sam",
seq1 = "mapping/fastq/1M_SRR9336468_1.fastq",
seq2 = "mapping/fastq/1M_SRR9336468_2.fastq",
overwrite = TRUE,
"--threads=3")
# Time difference of 2.249662 mins- Convert
.samfiles intobamfiles (more compact and indexed)
# convert SAM file into BAM file (and indexes .bai) with "asBam"
asBam("mapping/bam/SRR9336468.sam")
# Time difference of 1.345935 mins Una vez se tiene el .bam, el fichero .samse puede borrar ya que los ficheros .sam son los más prescindibles y los que más ocupan.
file.remove("mapping/bam/SRR9336468.sam") IGV
- Load genome : Genome / Local File / chromosomes / .fa + .fai
- Load genes : Tracks / Local File (gff3)
- View options
- Load reads : Tracks / Load File / select bam + index (bai)
Filtering reads by mapping quality
Read the .bam file
bamfile <- BamFile("mapping/bam/SRR9336468.bam")
bamfile## class: BamFile
## path: mapping/bam/SRR9336468.bam
## index: NA
## isOpen: FALSE
## yieldSize: NA
## obeyQname: FALSE
## asMates: FALSE
## qnamePrefixEnd: NA
## qnameSuffixStart: NA
Podemos acceder a las funciones (methods) disponibles para un objeto de clase BamFile con la siguiente llamada:
# # Methods accessors
methods(class = "BamFile")## [1] $ $<- asMates
## [4] asMates<- assayData<- close
## [7] coerce contents countBam
## [10] coverage dbconn dbfile
## [13] ExpressionSet filterBam findSpliceOverlaps
## [16] idxstatsBam index index<-
## [19] indexBam initialize isIncomplete
## [22] isOpen obeyQname obeyQname<-
## [25] open path pileup
## [28] qnamePrefixEnd qnamePrefixEnd<- qnameSuffixStart
## [31] qnameSuffixStart<- quickBamFlagSummary readGAlignmentPairs
## [34] readGAlignments readGAlignmentsList readGappedReads
## [37] scanBam scanBamHeader seqinfo
## [40] show sortBam summarizeOverlaps
## [43] testPairedEndBam updateObject yieldSize
## [46] yieldSize<-
## see '?methods' for accessing help and source code
seqinfo(bamfile)## Seqinfo object with 17 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## I 230218 <NA> <NA>
## II 813184 <NA> <NA>
## III 316620 <NA> <NA>
## IV 1531933 <NA> <NA>
## V 576874 <NA> <NA>
## ... ... ... ...
## XIII 924431 <NA> <NA>
## XIV 784333 <NA> <NA>
## XV 1091291 <NA> <NA>
## XVI 948066 <NA> <NA>
## Mito 85779 <NA> <NA>
Set the filter condition
# # build the filtered : .bam + .bai files
param = ScanBamParam(mapqFilter = 35) # min quality = 35 *
dest = filterBam(file = bamfile, destination = "mapping/bam/mapq.bam", param = param)
# en el próximo módulo veremos que también podemos controlar la calidad del alineamiento `Deseq2` con la `featureCounts` y la variable 'minMQS'Count number of alignments
bamq <- BamFile("mapping/bam/mapq.bam")
countBam(bamfile)## space start end width file records nucleotides
## 1 NA NA NA NA SRR9336468.bam 2000000 3e+08
countBam(bamq)## space start end width file records nucleotides
## 1 NA NA NA NA mapq.bam 1619149 242872350
Manipulating .bamfiles with scanBam
# # scanBam generates a list (length = 1) with 13 elements
aln <- scanBam(bamfile)
aln35 = scanBam(bamq)# # sequence reads
aln[[1]]$seq[1:10]
aln35[[1]]$seq[1:10]# # mapping quality reads
aln[[1]]$mapq[1:10]
aln35[[1]]$mapq[1:10]# # descriptive statistics
summary(aln[[1]]$mapq)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 42.00 42.00 37.37 42.00 42.00 104977
summary(aln35[[1]]$mapq)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 42.00 42.00 41.85 42.00 42.00
# # check number of reads based on a condition
sum(aln[[1]]$mapq < 35, na.rm = TRUE)## [1] 275874
sum(aln35[[1]]$mapq < 35, na.rm = TRUE)## [1] 0
DEA
*Analisis de genomas y transcriptomas con plataforma NGS
*Master Biotecnología
*2025
Jose V. Die, Dept. Genetics, UCO
jose.die@uco.es
En este módulo trabajaremos con las lecturas de Saccharomyces cerevisiae generadas en el experimento RNA-Seq del módulo anterior. Las lecturas obtenidas son del tipo paired-end reads donde cada fragmento está secuenciado por los dos extremos. El trabajo original puede consultarse en PubMed.
La tabla de metadatos es la siguiente : Bioproject : PRJNA550078.
Publicacion : 2020-02-27.
| SampleID | Description | Replicate | Millions of Spots | Biosample |
|---|---|---|---|---|
| SRR9336468 | pH5 CO2 0.04% | R1 | 23.8 | SAMN12107229 |
| SRR9336469 | pH5 CO2 0.04% | R2 | 21.9 | SAMN12107228 |
| SRR9336470 | pH5 CO2 0.04% | R3 | 21.6 | SAMN12107227 |
| SRR9336471 | pH5 CO2 50% | R1 | 23.1 | SAMN12107226 |
| SRR9336472 | pH5 CO2 50% | R2 | 23.4 | SAMN12107225 |
| SRR9336473 | pH5 CO2 50% | R3 | 20.5 | SAMN12107224 |
| SRR9336474 | pH3 CO2 0.04% | R1 | 23.0 | SAMN12107223 |
| SRR9336475 | pH3 CO2 0.04% | R2 | 20.9 | SAMN12107248 |
| SRR9336476 | pH3 CO2 0.04% | R3 | 21.4 | SAMN12107244 |
La primera parte del módulo ya se ha realizado en la sección del alineamiento. Para completar este módulo se realizarán las siguientes tareas :
- descarga anotación genoma S. cerevisiae (
.gff3) - descarga de secuencias RNA-Seq (SRR9336468)
- descompresión de ficheros (
.gz) - genome mapping alignment
- index the reference genome
- align fastq files
- convert
.samfiles into.bamfiles
- index the reference genome
- descarga
.bamfiles - organize the data
- DESeq2 object
- normalize raw counts
- clustering analyses
- model fitting
- model contrast
- resuls table
- visualizing results
- DEG annotation
Papers de interés:
Instalación de los paquetes que usaremos en este módulo.
BiocManager::install("Rsubread") # <- quantification of RNA-seq reads
BiocManager::install("DESeq2") # <- method for DEAUna vez instalados los paquetes necesarios para este módulo, cargamos las librerías.
# Dependencies
library(Rsamtools)
library(Rsubread)
library(DESeq2)Descarga ficheros
Estructura del proyecto
# check folders - if FALSE, create the folders
dir.exists("dea")
dir.exists("dea/genome")
dir.exists("dea/fastq")
dir.exists("dea/bam")
dir("dea") # see files and folders in the indicated path)Descarga ficheros secuencia genómica
Necesitamos los ficheros de la secuencia del genoma de S. cerevisiae (fasta) y el fichero de anotación (.gff3) que descargamos en el módulo anterior. Podemos copiarlos desde allí.
file.copy(from = "mapping/genome/Saccharomyces_cerevisiae_genome.fa", to = "dea/genome")
file.copy(from = "mapping/genome/Saccharomyces_cerevisiae_genome.fa.fai", to = "dea/genome")
file.copy(from = "mapping/genome/Saccharomyces_cerevisiae_genome.gff3", to = "dea/genome")dir("dea/genome")## [1] "Saccharomyces_cerevisiae_genome.fa"
## [2] "Saccharomyces_cerevisiae_genome.gff3"
Descarga de secuencias cortas Las lecturas de secuencia corta que necesitamos son las que hemos descargado en el módulo anterior y podemos copiarlas desde allí.
file.copy(from = "mapping/fastq/1M_SRR9336468_1.fastq", to = "dea/fastq")
file.copy(from = "mapping/fastq/1M_SRR9336468_2.fastq", to = "dea/fastq")dir("dea/fastq")## [1] "1M_SRR9336468_1.fastq" "1M_SRR9336468_2.fastq"
Genome mapping alignment
Este paso reproduce el alineamiento realizado en el módulo anterior módulo anterior con las secuencias ‘1M_SRR9336468_1.fastq’ y ‘1M_SRR9336468_2.fastq’ :
- Index the reference genome
- Align
.fastqfiles against the indexed genome - Convert
.samfile into.bamfile
El fichero ‘SRR9336468.bam’ que genera este proceso podemos copiarlo desde el caso anterior. En esta ocasión, vamos a llamarle ‘basal_R1.bam’.
file.copy(from = "mapping/bam/SRR9336468.bam", to = "dea/bam/basal_R1.bam")
file.copy(from = "mapping/bam/SRR9336468.bam.bai", to = "dea/bam/basal_R1.bam.bai")Download the rest of BAM files :
download.file("https://www.dropbox.com/s/iez6egp57hazmyc/bam_files.zip?dl=1",
destfile = "dea/bam/bam_files.zip")
#Time difference of 1.007498 minsUncompress bam_files.zip
unzip("dea/bam/bam_files.zip", exdir = "dea/bam" )
The zip file = 872.219 MB. However the dowloaded file = 659.2 MB. When the downloading step is not completed, for some reason, unzip does not work. In order to avoid problems, download the .zip file and uncompress manually.
Gene quantification
Quantify “genes” with `featureCounts .
# store all paths of bam files into "bamFiles" R object (use `list.files`)
bamFiles <- list.files(path = "dea/bam", pattern = "*.bam$", full.names = TRUE)
bamFiles## [1] "dea/bam/basal_R1.bam" "dea/bam/basal_R2.bam" "dea/bam/basal_R3.bam"
## [4] "dea/bam/indust1_R1.bam" "dea/bam/indust1_R2.bam" "dea/bam/indust1_R3.bam"
## [7] "dea/bam/indust2_R1.bam" "dea/bam/indust2_R2.bam" "dea/bam/indust2_R3.bam"
# quantify 'genes'
data <- Rsubread::featureCounts(
files = bamFiles,
annot.ext = "dea/genome/Saccharomyces_cerevisiae_genome.gff3",
isGTFAnnotation = TRUE,
GTF.featureType = "gene",
GTF.attrType = "ID",
isPairedEnd = TRUE,
requireBothEndsMapped = TRUE,
minMQS = 30,
strandSpecific = 2,
nthreads = 4) ##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.4.3
##
## //========================== featureCounts setting ===========================\\
## || ||
## || Input files : 9 BAM files ||
## || ||
## || basal_R1.bam ||
## || basal_R2.bam ||
## || basal_R3.bam ||
## || indust1_R1.bam ||
## || indust1_R2.bam ||
## || indust1_R3.bam ||
## || indust2_R1.bam ||
## || indust2_R2.bam ||
## || indust2_R3.bam ||
## || ||
## || Paired-end : yes ||
## || Count read pairs : yes ||
## || Annotation : Saccharomyces_cerevisiae_genome.gff3 (GTF) ||
## || Dir for temp files : . ||
## || Threads : 4 ||
## || Level : meta-feature level ||
## || Multimapping reads : counted ||
## || Multi-overlapping reads : not counted ||
## || Min overlapping bases : 1 ||
## || ||
## \\============================================================================//
##
## //================================= Running ==================================\\
## || ||
## || Load annotation file Saccharomyces_cerevisiae_genome.gff3 ... ||
## || Features : 6600 ||
## || Meta-features : 6600 ||
## || Chromosomes/contigs : 17 ||
## || ||
## || Process BAM file basal_R1.bam... ||
## || Strand specific : reversely stranded ||
## || Paired-end reads are included. ||
## || Total alignments : 1000000 ||
## || Successfully assigned alignments : 763960 (76.4%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file basal_R2.bam... ||
## || Strand specific : reversely stranded ||
## || Paired-end reads are included. ||
## || Total alignments : 1000000 ||
## || Successfully assigned alignments : 754176 (75.4%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file basal_R3.bam... ||
## || Strand specific : reversely stranded ||
## || Paired-end reads are included. ||
## || Total alignments : 1000000 ||
## || Successfully assigned alignments : 767728 (76.8%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file indust1_R1.bam... ||
## || Strand specific : reversely stranded ||
## || Paired-end reads are included. ||
## || Total alignments : 1000000 ||
## || Successfully assigned alignments : 770775 (77.1%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file indust1_R2.bam... ||
## || Strand specific : reversely stranded ||
## || Paired-end reads are included. ||
## || Total alignments : 1000000 ||
## || Successfully assigned alignments : 760137 (76.0%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file indust1_R3.bam... ||
## || Strand specific : reversely stranded ||
## || Paired-end reads are included. ||
## || Total alignments : 1000000 ||
## || Successfully assigned alignments : 756489 (75.6%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file indust2_R1.bam... ||
## || Strand specific : reversely stranded ||
## || Paired-end reads are included. ||
## || Total alignments : 1000000 ||
## || Successfully assigned alignments : 761090 (76.1%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file indust2_R2.bam... ||
## || Strand specific : reversely stranded ||
## || Paired-end reads are included. ||
## || Total alignments : 1000000 ||
## || Successfully assigned alignments : 760304 (76.0%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file indust2_R3.bam... ||
## || Strand specific : reversely stranded ||
## || Paired-end reads are included. ||
## || Total alignments : 1000000 ||
## || Successfully assigned alignments : 756364 (75.6%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Write the final count table. ||
## || Write the read assignment summary. ||
## || ||
## \\============================================================================//
El objeto data es una lista que contiene varios elementos. Las cuentas de lecturas se han almacenado en ‘counts’.
names(data)## [1] "counts" "annotation" "targets" "stat"
# Extract read counts into a new object
raw_counts = data$counts
# Example : read counts for a given gene over one condition (three bio. reps)
raw_counts["gene:YAL063C", 7:9]## indust2_R1.bam indust2_R2.bam indust2_R3.bam
## 0 0 0
Exercise : Does the mapping match what I see in IGV ?
DESeq workflow
Como punto de partida, DESeq requiere dos objetos con la siguiente información :
- raw counts of the number of reads aligning each gene
- associated sample metadata
Comenzaremos creando una tabla de metadatos donde se recoja la información del experimento. Un ejemplo puede verse aquí.
# Create sample vector
sampleID = paste0("SRR93364", 68:76)
# Create condition vector
conditions = rep(c("basal", "indust1", "indust2"), each = 3)
# Create description vector
description = rep(c("pH5_0.04CO2", "pH5_5CO2", "pH3_0.04CO2"), each = 3)
# Create data frame
saccha_metadata <- data.frame(sampleID, conditions, description)
# Assign the row names of the data frame
rownames(saccha_metadata) = paste(conditions, paste0("R",1:3), sep = "_")
# Save the df into a csv file
dir.create("dea/data")
write.csv(saccha_metadata, file = "dea/data/saccha_metadata.csv")DESeq2 requires the sample names in the metadata and counts dataset to be in the same order. Therefore, the row names in the metadata need to be in the same order as the column names of the count data (= mismo orden que los ficheros bam).
# column names of the count data
colnames(raw_counts) = gsub(".bam", "", colnames(raw_counts))
# row names of the metadata
rownames(saccha_metadata)## [1] "basal_R1" "basal_R2" "basal_R3" "indust1_R1" "indust1_R2"
## [6] "indust1_R3" "indust2_R1" "indust2_R2" "indust2_R3"
# check identity
identical(colnames(raw_counts), rownames(saccha_metadata))## [1] TRUE
dds <- DESeqDataSetFromMatrix(countData = raw_counts,
colData = saccha_metadata,
design = ~ conditions)## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
Normalize raw counts for library size.
The raw counts for each sample are divided by the associated sample-specific size factor for normalization. To view the size factors used for normalization, we can use the sizeFactors function.
dds <- estimateSizeFactors(dds)
sizeFactors(dds)## basal_R1 basal_R2 basal_R3 indust1_R1 indust1_R2 indust1_R3 indust2_R1
## 1.0575946 1.0486980 1.2369258 1.2574809 1.0282855 1.0269234 0.9417575
## indust2_R2 indust2_R3
## 0.8067719 0.8060861
norm_counts <- counts(dds, normalized = TRUE)
# Histogram
par(mfrow=c(1,2))
boxplot(log2(counts(dds, normalized = FALSE)+1),
cex = 0.5, cex.axis = .9, main = "non-normalized libraries")
boxplot(log2(norm_counts+1), cex = 0.5, cex.axis = .9,
main = "normalized libraries")# Scatterplot : log normalized counts against each other
log.norm.counts = log2(norm_counts + 1)
plot(log.norm.counts[,1:2], cex=.1)Unsupervised Clustering Analyses.
# Transform the normalized counts : VST-transformed normalized counts
vst_counts <- vst(dds, blind = TRUE)
# Extract the matrix of transformed counts
vsd <- assay(vst_counts)
# Compute the correlation values between samples
vsd_cor <- cor(vsd) # Hierarchical analysis
BiocManager::install("pheatmap")
pheatmap::pheatmap(vsd_cor, annotation = dplyr::select(saccha_metadata, conditions))# Plot PCA
plotPCA(vst_counts, intgroup = "conditions")Model fitting.
Run the DESeq2 analysis : model fitting for gene expression data.
dds <- DESeq(dds)# Plot dispersions
plotDispEsts(dds)Model contrast.
Run the DESeq2 model contrast to extract the results.
P5C5vsP5C004 <- results(dds,
contrast=c("conditions", "indust1", "basal"),
alpha = 0.05)
plotMA(P5C5vsP5C004, colSig = "red")Save results as a data frame.
write.csv(as.data.frame(P5C5vsP5C004),
file="dea/results/P5C5vsP5C004_DEG.csv")Results table.
DESeq2 results table.
mcols(P5C5vsP5C004)## DataFrame with 6 rows and 2 columns
## type description
## <character> <character>
## baseMean intermediate mean of normalized c..
## log2FoldChange results log2 fold change (ML..
## lfcSE results standard error: cond..
## stat results Wald statistic: cond..
## pvalue results Wald test p-value: c..
## padj results BH adjusted p-values
summary(P5C5vsP5C004)##
## out of 6244 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 37, 0.59%
## LFC < 0 (down) : 42, 0.67%
## outliers [1] : 0, 0%
## low counts [2] : 966, 15%
## (mean count < 9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
head(P5C5vsP5C004)## log2 fold change (MLE): conditions indust1 vs basal
## Wald test p-value: conditions indust1 vs basal
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## gene:YAL069W 0 NA NA NA NA NA
## gene:YAL068W-A 0 NA NA NA NA NA
## gene:YAL068C 0 NA NA NA NA NA
## gene:YAL067W-A 0 NA NA NA NA NA
## gene:YAL067C 0 NA NA NA NA NA
## gene:YAL066W 0 NA NA NA NA NA
Run a model contrast setting a log2 threshold to extract genes with biological relevance.
P5C5vsP5C004 <- results(dds,
contrast=c("conditions", "indust1", "basal"),
alpha = 0.05,
lfcThreshold = 0.32) # 1.25-fold-change## Número de genes diferencialmente expresados (P < 0.05) : 26
## Número de genes diferencialmente sobre-expresados (P < 0.05) : 7
## Número de genes diferencialmente inhibidos (P < 0.05) : 19
Resultados genes sobre-expresados :
subset(P5C5vsP5C004, log2FoldChange > 0 & padj < 0.05)## log2 fold change (MLE): conditions indust1 vs basal
## Wald test p-value: conditions indust1 vs basal
## DataFrame with 7 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## gene:YER065C 219.8343 1.41929 0.228886 4.80281 1.56455e-06
## gene:YHR216W 31.6235 1.99321 0.348451 4.80186 1.57199e-06
## gene:YKL029C 77.5936 1.49898 0.301234 3.91383 9.08451e-05
## gene:YKR097W 116.2186 3.02692 0.292078 9.26781 1.90016e-20
## gene:YLR377C 119.0889 2.64950 0.302826 7.69254 1.44243e-14
## gene:YOL155C 692.0430 3.29114 0.588143 5.05173 4.37817e-07
## gene:YPL156C 165.8731 1.29763 0.246557 3.96515 7.33496e-05
## padj
## <numeric>
## gene:YER065C 6.18391e-04
## gene:YHR216W 6.18391e-04
## gene:YKL029C 2.18168e-02
## gene:YKR097W 3.95486e-17
## gene:YLR377C 1.80131e-11
## gene:YOL155C 2.73373e-04
## gene:YPL156C 1.99128e-02
Resultados genes inhibidos :
subset(P5C5vsP5C004, log2FoldChange < 0 & padj < 0.05)## log2 fold change (MLE): conditions indust1 vs basal
## Wald test p-value: conditions indust1 vs basal
## DataFrame with 19 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## gene:YBR068C 630.234 -1.98563 0.230943 -7.21232 5.50045e-13
## gene:YBR117C 194.545 -1.25507 0.236647 -3.95135 7.77120e-05
## gene:YDR461W 820.879 -1.97600 0.198064 -8.36096 6.22091e-17
## gene:YIL117C 114.665 -1.35399 0.263720 -3.92079 8.82597e-05
## gene:YIL015W 121.557 -1.56497 0.250900 -4.96202 6.97627e-07
## ... ... ... ... ... ...
## gene:YNR030W 138.0639 -1.28392 0.240581 -4.00661 6.15957e-05
## gene:YNR044W 99.6903 -1.59479 0.265567 -4.80026 1.58460e-06
## gene:YPL223C 844.8084 -1.15058 0.198634 -4.18147 2.89624e-05
## gene:YPL187W 156.2926 -3.33751 0.736083 -4.09942 4.14182e-05
## gene:YPL058C 1130.3495 -4.14848 0.303836 -12.60049 2.09849e-36
## padj
## <numeric>
## gene:YBR068C 5.72414e-10
## gene:YBR117C 2.02181e-02
## gene:YDR461W 9.71084e-14
## gene:YIL117C 2.18168e-02
## gene:YIL015W 3.95998e-04
## ... ...
## gene:YNR030W 1.74820e-02
## gene:YNR044W 6.18391e-04
## gene:YPL223C 9.51797e-03
## gene:YPL187W 1.23150e-02
## gene:YPL058C 6.55148e-33
Visualización de Resultados.
Visualización : MA plot
plotMA(P5C5vsP5C004, colSig = "red")Visualización : heatmap
# Subset normalized counts to significant genes
degs <- rownames(subset(P5C5vsP5C004, padj < 0.05))
sig_norm_counts <- norm_counts[degs, ] # tabla final para heatmap
# Choose a color palette from RColorBrewer
library(RColorBrewer)
heat_colors <- brewer.pal(6, "YlOrRd")
pheatmap::pheatmap(sig_norm_counts,
color = heat_colors,
cluster_rows = TRUE,
show_rownames = FALSE,
annotation = dplyr::select(saccha_metadata, conditions),
scale = "row")DEGs annotation.
Download genome annotation in tabular format from NCBI.
# Download genome annotation
download.file("https://www.dropbox.com/s/clp3eson00mlca3/sacCer_annotation.csv?dl=1",
destfile = "dea/data/saccha_annotation.csv")# Read the annotation
annot <- read.csv("dea/data/saccha_annotation.csv") La tabla de DEGs la obtenemos del siguiente modo :
deg_res <- subset(P5C5vsP5C004, padj < 0.05)[degs, ]El objeto deg_res es de tipo DESeq2 pero necesitamos convertirlo a data.frame para aplicar una función que definiremos a continuación.
class((deg_res))## [1] "DESeqResults"
## attr(,"package")
## [1] "DESeq2"
deg_res = as.data.frame(deg_res)
class((deg_res))## [1] "data.frame"
Call the function final_table and get a final table of DEGS with their annotation.
res_final <- final_table(deg_res)
res_final %>% arrange(desc(log2FoldChange))## # A tibble: 26 × 16
## gene Protein.Name baseMean log2FoldChange lfcSE stat pvalue padj
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 YOL155C mannoprotein 692. 3.29 0.588 5.05 4.38e- 7 2.73e- 4
## 2 YKR097W phosphoenolpyr… 116. 3.03 0.292 9.27 1.90e-20 3.95e-17
## 3 YLR377C fructose 1,6-b… 119. 2.65 0.303 7.69 1.44e-14 1.80e-11
## 4 YHR216W IMP dehydrogen… 31.6 1.99 0.348 4.80 1.57e- 6 6.18e- 4
## 5 YKL029C malate dehydro… 77.6 1.50 0.301 3.91 9.08e- 5 2.18e- 2
## 6 YER065C isocitrate lya… 220. 1.42 0.229 4.80 1.56e- 6 6.18e- 4
## 7 YPL156C pheromone-regu… 166. 1.30 0.247 3.97 7.33e- 5 1.99e- 2
## 8 YPL223C Gre1p 845. -1.15 0.199 -4.18 2.90e- 5 9.52e- 3
## 9 YBR117C transketolase … 195. -1.26 0.237 -3.95 7.77e- 5 2.02e- 2
## 10 YNR030W dolichyl-P-Man… 138. -1.28 0.241 -4.01 6.16e- 5 1.75e- 2
## # ℹ 16 more rows
## # ℹ 8 more variables: chr <chr>, Accession <chr>, Start <int>, Stop <int>,
## # Strand <chr>, Locus <chr>, Protein.product <chr>, Length <int>